CMPINF 2100 Week 08¶

Motivate - the example data for working with MANY variables¶

Import Modules¶

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

Penguins dataset¶

In [3]:
penguins = sns.load_dataset("penguins")
In [4]:
penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
In [5]:
sns.pairplot(data=penguins)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Diamonds¶

In [6]:
diamonds = sns.load_dataset("diamonds")
In [7]:
diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB
In [9]:
sns.pairplot(data=diamonds)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Wine data¶

In [29]:
wine_names = ["Cultivar", "Alcohol", "Malic_acid", "Ash", "Alcalinity_of_ash", "Magnesium", "Total_phenols", 
              "Flavanoids", "Nonflavanoid_phenols", "Proanthocyanin", "Color_intensity", "Hue", "OD280_OD315", "Proline"]
In [49]:
wine_data = pd.read_csv("wine_data.txt", delimiter=",", names=wine_names)
In [50]:
wine_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Cultivar              178 non-null    int64  
 1   Alcohol               178 non-null    float64
 2   Malic_acid            178 non-null    float64
 3   Ash                   178 non-null    float64
 4   Alcalinity_of_ash     178 non-null    float64
 5   Magnesium             178 non-null    int64  
 6   Total_phenols         178 non-null    float64
 7   Flavanoids            178 non-null    float64
 8   Nonflavanoid_phenols  178 non-null    float64
 9   Proanthocyanin        178 non-null    float64
 10  Color_intensity       178 non-null    float64
 11  Hue                   178 non-null    float64
 12  OD280_OD315           178 non-null    float64
 13  Proline               178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB
In [51]:
wine_data
Out[51]:
Cultivar Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanin Color_intensity Hue OD280_OD315 Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740
174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750
175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835
176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840
177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560

178 rows × 14 columns

In [52]:
sns.pairplot(data=wine_data)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Sonar data¶

In [53]:
sonar_data = pd.read_csv("sonar_alldata.txt", sep=",", header=None) 
In [54]:
sonar_data
Out[54]:
0 1 2 3 4 5 6 7 8 9 ... 51 52 53 54 55 56 57 58 59 60
0 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 ... 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032 R
1 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 ... 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044 R
2 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 ... 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078 R
3 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 ... 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117 R
4 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 ... 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094 R
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
203 0.0187 0.0346 0.0168 0.0177 0.0393 0.1630 0.2028 0.1694 0.2328 0.2684 ... 0.0116 0.0098 0.0199 0.0033 0.0101 0.0065 0.0115 0.0193 0.0157 M
204 0.0323 0.0101 0.0298 0.0564 0.0760 0.0958 0.0990 0.1018 0.1030 0.2154 ... 0.0061 0.0093 0.0135 0.0063 0.0063 0.0034 0.0032 0.0062 0.0067 M
205 0.0522 0.0437 0.0180 0.0292 0.0351 0.1171 0.1257 0.1178 0.1258 0.2529 ... 0.0160 0.0029 0.0051 0.0062 0.0089 0.0140 0.0138 0.0077 0.0031 M
206 0.0303 0.0353 0.0490 0.0608 0.0167 0.1354 0.1465 0.1123 0.1945 0.2354 ... 0.0086 0.0046 0.0126 0.0036 0.0035 0.0034 0.0079 0.0036 0.0048 M
207 0.0260 0.0363 0.0136 0.0272 0.0214 0.0338 0.0655 0.1400 0.1843 0.2354 ... 0.0146 0.0129 0.0047 0.0039 0.0061 0.0040 0.0036 0.0061 0.0115 M

208 rows × 61 columns

In [56]:
sonar_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 61 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       208 non-null    float64
 1   1       208 non-null    float64
 2   2       208 non-null    float64
 3   3       208 non-null    float64
 4   4       208 non-null    float64
 5   5       208 non-null    float64
 6   6       208 non-null    float64
 7   7       208 non-null    float64
 8   8       208 non-null    float64
 9   9       208 non-null    float64
 10  10      208 non-null    float64
 11  11      208 non-null    float64
 12  12      208 non-null    float64
 13  13      208 non-null    float64
 14  14      208 non-null    float64
 15  15      208 non-null    float64
 16  16      208 non-null    float64
 17  17      208 non-null    float64
 18  18      208 non-null    float64
 19  19      208 non-null    float64
 20  20      208 non-null    float64
 21  21      208 non-null    float64
 22  22      208 non-null    float64
 23  23      208 non-null    float64
 24  24      208 non-null    float64
 25  25      208 non-null    float64
 26  26      208 non-null    float64
 27  27      208 non-null    float64
 28  28      208 non-null    float64
 29  29      208 non-null    float64
 30  30      208 non-null    float64
 31  31      208 non-null    float64
 32  32      208 non-null    float64
 33  33      208 non-null    float64
 34  34      208 non-null    float64
 35  35      208 non-null    float64
 36  36      208 non-null    float64
 37  37      208 non-null    float64
 38  38      208 non-null    float64
 39  39      208 non-null    float64
 40  40      208 non-null    float64
 41  41      208 non-null    float64
 42  42      208 non-null    float64
 43  43      208 non-null    float64
 44  44      208 non-null    float64
 45  45      208 non-null    float64
 46  46      208 non-null    float64
 47  47      208 non-null    float64
 48  48      208 non-null    float64
 49  49      208 non-null    float64
 50  50      208 non-null    float64
 51  51      208 non-null    float64
 52  52      208 non-null    float64
 53  53      208 non-null    float64
 54  54      208 non-null    float64
 55  55      208 non-null    float64
 56  56      208 non-null    float64
 57  57      208 non-null    float64
 58  58      208 non-null    float64
 59  59      208 non-null    float64
 60  60      208 non-null    object 
dtypes: float64(60), object(1)
memory usage: 99.2+ KB
In [57]:
sonar_data.shape
Out[57]:
(208, 61)

IMPORTANT: Do NOT forget about RESHAPING¶

Reshaping WIDE to LONG format will help you explore many cols!!

Diamonds¶

Lets reshape this dataframe from wide to long format. We will gather/stack all numeric columns on TOP OF each other.

The non-numeric cols will not be gathered/stacked up.

In [62]:
diamonds_numeric_names = diamonds.select_dtypes("number").columns.tolist()
In [63]:
diamonds_numeric_names
Out[63]:
['carat', 'depth', 'table', 'price', 'x', 'y', 'z']
In [64]:
diamonds_category_names = diamonds.select_dtypes("category").columns.tolist()
In [65]:
diamonds_category_names
Out[65]:
['cut', 'color', 'clarity']

Reshape from WIDE to LONG!

In [71]:
diamonds_lf = diamonds.reset_index().\
rename(columns={"index": "rowid"}).\
melt(id_vars=["rowid"]+diamonds_category_names, 
     value_vars=diamonds_numeric_names)
In [72]:
diamonds_lf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 377580 entries, 0 to 377579
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype   
---  ------    --------------   -----   
 0   rowid     377580 non-null  int64   
 1   cut       377580 non-null  category
 2   color     377580 non-null  category
 3   clarity   377580 non-null  category
 4   variable  377580 non-null  object  
 5   value     377580 non-null  float64 
dtypes: category(3), float64(1), int64(1), object(1)
memory usage: 9.7+ MB

We can now associate the variable column within the LONG format data with the column argument to create the COLUMN FACETS!!

In [73]:
diamonds_lf.variable.value_counts()
Out[73]:
variable
carat    53940
depth    53940
table    53940
price    53940
x        53940
y        53940
z        53940
Name: count, dtype: int64
In [76]:
sns.displot(data=diamonds_lf, x="value", col="variable", col_wrap=3, 
            kind="hist",
            facet_kws={"sharex": False, "sharey": False},
            common_bins=False)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

We can also study the CONDITIONAL DISTRIBUTIONS of the num cols GIVEN or GROUPED BY a cat variable!!!

In [81]:
sns.catplot(data=diamonds_lf, x="color", y="value", col="variable", col_wrap=3,
            kind="box", sharey=False, hue="color", palette="coolwarm")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

We can study the CONDITIONAL MEANS or the AVG per GROUP!!!

In [82]:
sns.catplot(data=diamonds_lf, x="color", y="value", col="variable", col_wrap=3,
            kind="point", sharey=False, hue="color", palette="coolwarm")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

You always need to examine the SHAPE of the distribution!!

In [83]:
sns.catplot(data=diamonds_lf, x="color", y="value", col="variable", col_wrap=3,
            kind="violin", sharey=False, hue="color", palette="coolwarm")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Sonar¶

In [84]:
sonar_data.shape
Out[84]:
(208, 61)
In [85]:
sonar_data.dtypes.value_counts()
Out[85]:
float64    60
object      1
Name: count, dtype: int64
In [88]:
sonar_data.dtypes
Out[88]:
0     float64
1     float64
2     float64
3     float64
4     float64
       ...   
56    float64
57    float64
58    float64
59    float64
60     object
Length: 61, dtype: object
In [89]:
sonar_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 61 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       208 non-null    float64
 1   1       208 non-null    float64
 2   2       208 non-null    float64
 3   3       208 non-null    float64
 4   4       208 non-null    float64
 5   5       208 non-null    float64
 6   6       208 non-null    float64
 7   7       208 non-null    float64
 8   8       208 non-null    float64
 9   9       208 non-null    float64
 10  10      208 non-null    float64
 11  11      208 non-null    float64
 12  12      208 non-null    float64
 13  13      208 non-null    float64
 14  14      208 non-null    float64
 15  15      208 non-null    float64
 16  16      208 non-null    float64
 17  17      208 non-null    float64
 18  18      208 non-null    float64
 19  19      208 non-null    float64
 20  20      208 non-null    float64
 21  21      208 non-null    float64
 22  22      208 non-null    float64
 23  23      208 non-null    float64
 24  24      208 non-null    float64
 25  25      208 non-null    float64
 26  26      208 non-null    float64
 27  27      208 non-null    float64
 28  28      208 non-null    float64
 29  29      208 non-null    float64
 30  30      208 non-null    float64
 31  31      208 non-null    float64
 32  32      208 non-null    float64
 33  33      208 non-null    float64
 34  34      208 non-null    float64
 35  35      208 non-null    float64
 36  36      208 non-null    float64
 37  37      208 non-null    float64
 38  38      208 non-null    float64
 39  39      208 non-null    float64
 40  40      208 non-null    float64
 41  41      208 non-null    float64
 42  42      208 non-null    float64
 43  43      208 non-null    float64
 44  44      208 non-null    float64
 45  45      208 non-null    float64
 46  46      208 non-null    float64
 47  47      208 non-null    float64
 48  48      208 non-null    float64
 49  49      208 non-null    float64
 50  50      208 non-null    float64
 51  51      208 non-null    float64
 52  52      208 non-null    float64
 53  53      208 non-null    float64
 54  54      208 non-null    float64
 55  55      208 non-null    float64
 56  56      208 non-null    float64
 57  57      208 non-null    float64
 58  58      208 non-null    float64
 59  59      208 non-null    float64
 60  60      208 non-null    object 
dtypes: float64(60), object(1)
memory usage: 99.2+ KB

I do not like to use NUMBERS as col names!!!

Lets change them using LIST COMPREHENSION!!

Lets change the col names to the pattern X00, X01, X02...

In [91]:
"X%02d" % 0
Out[91]:
'X00'
In [92]:
"X%02d" % 10
Out[92]:
'X10'
In [93]:
["X%02d" % d for d in sonar_data.columns]
Out[93]:
['X00',
 'X01',
 'X02',
 'X03',
 'X04',
 'X05',
 'X06',
 'X07',
 'X08',
 'X09',
 'X10',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'X16',
 'X17',
 'X18',
 'X19',
 'X20',
 'X21',
 'X22',
 'X23',
 'X24',
 'X25',
 'X26',
 'X27',
 'X28',
 'X29',
 'X30',
 'X31',
 'X32',
 'X33',
 'X34',
 'X35',
 'X36',
 'X37',
 'X38',
 'X39',
 'X40',
 'X41',
 'X42',
 'X43',
 'X44',
 'X45',
 'X46',
 'X47',
 'X48',
 'X49',
 'X50',
 'X51',
 'X52',
 'X53',
 'X54',
 'X55',
 'X56',
 'X57',
 'X58',
 'X59',
 'X60']
In [94]:
sonar_data.columns = ["X%02d" % d for d in sonar_data.columns]
In [96]:
sonar_data.columns
Out[96]:
Index(['X00', 'X01', 'X02', 'X03', 'X04', 'X05', 'X06', 'X07', 'X08', 'X09',
       'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19',
       'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29',
       'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39',
       'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49',
       'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59',
       'X60'],
      dtype='object')
In [97]:
sonar_data["X00"]
Out[97]:
0      0.0200
1      0.0453
2      0.0262
3      0.0100
4      0.0762
        ...  
203    0.0187
204    0.0323
205    0.0522
206    0.0303
207    0.0260
Name: X00, Length: 208, dtype: float64

We need to extract the numeric col name.

In [98]:
sonar_numeric_names = sonar_data.select_dtypes("number").columns.to_list()
In [100]:
sonar_numeric_names
Out[100]:
['X00',
 'X01',
 'X02',
 'X03',
 'X04',
 'X05',
 'X06',
 'X07',
 'X08',
 'X09',
 'X10',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'X16',
 'X17',
 'X18',
 'X19',
 'X20',
 'X21',
 'X22',
 'X23',
 'X24',
 'X25',
 'X26',
 'X27',
 'X28',
 'X29',
 'X30',
 'X31',
 'X32',
 'X33',
 'X34',
 'X35',
 'X36',
 'X37',
 'X38',
 'X39',
 'X40',
 'X41',
 'X42',
 'X43',
 'X44',
 'X45',
 'X46',
 'X47',
 'X48',
 'X49',
 'X50',
 'X51',
 'X52',
 'X53',
 'X54',
 'X55',
 'X56',
 'X57',
 'X58',
 'X59']
In [103]:
sonar_category_names = sonar_data.select_dtypes("object").columns.to_list()
In [104]:
sonar_category_names
Out[104]:
['X60']

RESHAPE from WIDE to LONG format!!

In [108]:
sonar_lf = sonar_data.reset_index().\
rename(columns={"index":"rowid"}).\
melt(id_vars=["rowid"]+sonar_category_names, value_vars=sonar_numeric_names)
In [109]:
sonar_lf
Out[109]:
rowid X60 variable value
0 0 R X00 0.0200
1 1 R X00 0.0453
2 2 R X00 0.0262
3 3 R X00 0.0100
4 4 R X00 0.0762
... ... ... ... ...
12475 203 M X59 0.0157
12476 204 M X59 0.0067
12477 205 M X59 0.0031
12478 206 M X59 0.0048
12479 207 M X59 0.0115

12480 rows × 4 columns

In [110]:
sonar_lf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12480 entries, 0 to 12479
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   rowid     12480 non-null  int64  
 1   X60       12480 non-null  object 
 2   variable  12480 non-null  object 
 3   value     12480 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 390.1+ KB
In [112]:
sonar_lf.variable.value_counts()
Out[112]:
variable
X00    208
X01    208
X32    208
X33    208
X34    208
X35    208
X36    208
X37    208
X38    208
X39    208
X40    208
X41    208
X42    208
X43    208
X44    208
X45    208
X46    208
X47    208
X48    208
X49    208
X50    208
X51    208
X52    208
X53    208
X54    208
X55    208
X56    208
X57    208
X58    208
X31    208
X30    208
X29    208
X14    208
X02    208
X03    208
X04    208
X05    208
X06    208
X07    208
X08    208
X09    208
X10    208
X11    208
X12    208
X13    208
X15    208
X28    208
X16    208
X17    208
X18    208
X19    208
X20    208
X21    208
X22    208
X23    208
X24    208
X25    208
X26    208
X27    208
X59    208
Name: count, dtype: int64

We can now use Seaborn to associate FACETS for each uniqye valye of variable to examine the original wide format numeric columns!!

In [120]:
sns.displot(data=sonar_lf, x="value", col="variable", col_wrap=5,
            facet_kws={"sharex": False, "sharey": False}, common_bins=False,
            kind="hist")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

We can also create the CONDITIONAL KD!!! Where we COLOR by the OBJECT COLUMN!!

In [122]:
sns.displot(data=sonar_lf, x="value", col="variable", hue="X60", col_wrap=5,
            facet_kws={"sharex": False, "sharey": False}, common_norm=False,
            kind="kde")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

We can also use VIOLINS to compare the CONDITIONAL distribution SHAPES.

In [124]:
sns.catplot(data=sonar_lf, y="value", col="variable", x="X60", col_wrap=5,
            sharey=False, hue="X60",
            kind="violin")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [127]:
sns.catplot(data=sonar_lf, y="value", col="variable", x="X60", col_wrap=5,
            sharey=False, hue="X60", join=False,
            kind="point")

plt.show()
/var/folders/hn/_r1c754d1kj1fxryljd8w6g80000gn/T/ipykernel_88181/3688863808.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data=sonar_lf, y="value", col="variable", x="X60", col_wrap=5,
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

BUT there is something unique about this dataset...to see it...lets use the WIDE FORMAT Seaborn plotting...

In [128]:
sns.catplot(data=sonar_data, kind="box", aspect=3.5)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [131]:
sns.catplot(data=sonar_data, kind="point", aspect=3.5, join=False)

plt.show()
/var/folders/hn/_r1c754d1kj1fxryljd8w6g80000gn/T/ipykernel_88181/15802379.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data=sonar_data, kind="point", aspect=3.5, join=False)
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The WIDE format does NOT let us GROUP BY categoricals.

The LONG format lets us GROUP BY categorical variables and associate FIGURE elemnts with the numerics!!

In [133]:
sns.catplot(data=sonar_lf, x="variable", y="value", kind="box", aspect=3.5, hue="variable")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

GROUP BY X60.

In [134]:
sns.catplot(data=sonar_lf, x="variable", y="value", kind="box", aspect=3.5, hue="X60")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [136]:
sns.catplot(data=sonar_lf, x="variable", y="value", kind="point", aspect=3.5, hue="X60", join=False)

plt.show()
/var/folders/hn/_r1c754d1kj1fxryljd8w6g80000gn/T/ipykernel_88181/2773777666.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data=sonar_lf, x="variable", y="value", kind="point", aspect=3.5, hue="X60", join=False)
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Summary¶

If we have less than 10 numeric variables, we can use point plots.

If we have more than 10 numeric variables, its hard to use point plots and that is motivating us to use Cluster Analysis and PCA.

We have to RESHAPE the data in order to explore the numeric data through FACETS.

In [ ]: